<!DOCTYPE html>
Introduction This procedure analyzes the RNA-seq data of multiple samples for quality control purpose, using the gene-level read counts. The read counts can be provided as one or multiple integer data matrixes, corresponding to different mapping types, such as unique vs. multiple mapping and sense vs. antisense strands. The following steps will be applied to the read count data:
Transcriptome in immune cells of control-patient samples
Rna-seq data was generated from of 3 types of immune cells of 3 controls and 3 patients. Raw data was processed to get gene-level read counts.
The total read count per sample is the sum of sequence reads of one sample mapped to all genes. Inconsistency of total reads between samples might suggest data quality issues and affect downstream analysis. For example, low total read count could be caused by high level of RNA degradation or high rate of sequencing errors. Shapiro-Wilk normality test shows that the total read counts of this data set is normally distributed (p = 0.21). Samples with extreme total read counts comparing to the others:
Figure 1. Distribution of total read counts of all samples. Total read count per sample (millions):
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.446 7.491 8.600 8.863 10.050 12.370
Due to the variability of gene length and expression level, it is expected that a large portion of the sequence reads are mapped to a small number of genes. At the same time, a large portion of the genes will have no or few reads mapped to them, making them unsuitable for statistical analysis. The distribution of read counts across genes and the consistency of such distribution between samples also provide information about RNA-seq data quality. In this data set,
Figure 2. The total read count of all samples mapped to each gene is used to sort genes from high to low. The cumulative percent of top genes is plotted along gene ranking.
Figure 3. The cumulative read counts of top 10 genes and the percent of total reads by these genes are calculated for each sample. Samples are ordered by the percents from low to high on the x-axis in this figure. The 2 samples with the lowest and the highest cumulative percents are labeled. The cumulative percents of all samples are then compared to each other to identify samples with extremely lower or higher percents than the other samples:
The read-to-gene mapping could be complicated by at least 3 conditions corresponding to 8 mapping types:
This analysis accepts multiple matching matrixes of read counts corresponding to different mapping types. By default, the first matrix corresponds to the most common mapping type and will be used for all the other analyses in this report. In this section, however, read counts of different mapping types are summarized and compared to each other when multiple matrixes are provided.
Figure 4. Total mapped reads of all samples are split by mapping types, such as unique vs. multiple, paired vs. unpaired, and sense vs. antisense. Among the 8 mapping type(s),
When the gene-level read counts of two mapping types are strongly correlated to each other, they can be combined to increase total read counts and hence statistical power of data analysis. Negative or lack of correlation between mapping types might also provide useful information.
Figure 5. The mapping type has the strongest correlation to the first and most common mapping type is identified based on Spearman’s correlation coefficients. The gene-level read counts of both mapping types are plotted in this figure. Click here to view correlation coefficients between all pairs of mapping types.
By comparing read counts of different mapping types to each other, the ratios can be compared between samples for consistency. A sample might have quality issue if it has reads of a mapping type much less or more than the other samples.
Figure 6. The relative frequecy of mapping types are calculated for each sample, using the first read count matrix as reference. The frequency of each mapping type are normalized across all samples (Mean = 0 and SD = 1.0). This figure shows the normalized frequency of each mapping type in each sample (red = high). - Mapping type Multiple_Unpaired_Sense has the most “abnormally” high relative frequency (# of standard deviations = 0.42) in sample T_Patient_F, B_Control_C. - Mapping type Unique_Unpaired_As has the most “abnormally” low relative frequency (# of standard deviations = -1) in sample T_Patient_F.
Gene-specific dispersion of RNA-seq read counts is commonly used to evaluate between-sample variance of a data set. Here, the dispersion is estimated by the overall pattern of coefficient of variation (standard deviation devided by average read count) of all genes.
Figure 7. Left: the distribution of average gene expression level (read count divided by gene length). Right: the dispersion of gene expression level between samples, measured as the coefficient of variation. Usually, genes with lower expression will have larger dispersion on average.
There are many ways to normalize RNA-seq data to remove systematic bias between samples, usually based on the assumption that all the samples have the same global distribution. The following analysis performs several different normalization methods and evaluate their impact on data dispersion:
Check this paper for detailed comparison of normalization methods, and this function for R code of these normalization methods.
| Method | Mean | Change(%) | Corr2Original | Num_Decrease | Num_Increase | Decrease/Increase | |
|---|---|---|---|---|---|---|---|
| Original | Original | 1.1953 | 0.0000 | 1.0000 | 0 | 0 | NaN |
| CL | CL | 0.9093 | -23.9246 | 0.9700 | 20544 | 2010 | 10.22 |
| DESeq | DESeq | 1.1670 | -2.3660 | 0.9900 | 14207 | 8032 | 1.77 |
| FPKM | FPKM | 1.2474 | 4.3598 | 1.0000 | 5253 | 16590 | 0.32 |
| Linear | Linear | 0.9567 | -19.9642 | 0.9800 | 19647 | 2907 | 6.76 |
| Loess | Loess | 0.9071 | -24.1097 | 0.9800 | 20680 | 1874 | 11.04 |
| Median | Median | 1.1731 | -1.8580 | 0.9900 | 13328 | 8561 | 1.56 |
| 1.1406 | -4.5780 | 0.9900 | 15837 | 5982 | 2.65 | ||
| RLE | RLE | 1.1418 | -4.4743 | 1.0000 | 16415 | 6139 | 2.67 |
| Rlog | Rlog | 0.2918 | -75.5911 | -0.0034 | 22084 | 470 | 46.99 |
| TMM | TMM | 1.1421 | -4.4503 | 1.0000 | 16538 | 6016 | 2.75 |
| TC | TC | 1.2474 | 4.3598 | 1.0000 | 5372 | 16591 | 0.32 |
| TPM | TPM | 1.2744 | 6.6166 | 1.0000 | 4349 | 17477 | 0.25 |
| UQ | UQ | 1.1544 | -3.4250 | 1.0000 | 13455 | 7861 | 1.71 |
| VST | VST | 0.1757 | -85.3030 | 0.4800 | 22551 | 3 | 7517.00 |
Figure 9. Each boxplot corresponds to a normalization method and the overall distribution of change in coefficient of variation of all genes. The means of all genes are labeled as the read stars. Click here to view the table of coefficients of variation of all genes.
Table 1. Summary of coefficient of variation (CV) from each normalization method. The columns are 1) average CV of all genes; 2) correlation of CV between the original read counts and normalized read counts of all genes; 3) the number of gene with decreased CV; 4) the number of genes with increased CV; and 5) the ratio of decreased to increased genes.
| Method | Mean | Change(%) | Corr2Original | Num_Decrease | Num_Increase | Decrease/Increase | |
|---|---|---|---|---|---|---|---|
| Original | Original | 1.1953 | 0.0000 | 1.0000 | 0 | 0 | NaN |
| CL | CL | 0.9093 | -23.9246 | 0.9700 | 20544 | 2010 | 10.22 |
| DESeq | DESeq | 1.1670 | -2.3660 | 0.9900 | 14207 | 8032 | 1.77 |
| FPKM | FPKM | 1.2474 | 4.3598 | 1.0000 | 5253 | 16590 | 0.32 |
| Linear | Linear | 0.9567 | -19.9642 | 0.9800 | 19647 | 2907 | 6.76 |
| Loess | Loess | 0.9071 | -24.1097 | 0.9800 | 20680 | 1874 | 11.04 |
| Median | Median | 1.1731 | -1.8580 | 0.9900 | 13328 | 8561 | 1.56 |
| 1.1406 | -4.5780 | 0.9900 | 15837 | 5982 | 2.65 | ||
| RLE | RLE | 1.1418 | -4.4743 | 1.0000 | 16415 | 6139 | 2.67 |
| Rlog | Rlog | 0.2918 | -75.5911 | -0.0034 | 22084 | 470 | 46.99 |
| TMM | TMM | 1.1421 | -4.4503 | 1.0000 | 16538 | 6016 | 2.75 |
| TC | TC | 1.2474 | 4.3598 | 1.0000 | 5372 | 16591 | 0.32 |
| TPM | TPM | 1.2744 | 6.6166 | 1.0000 | 4349 | 17477 | 0.25 |
| UQ | UQ | 1.1544 | -3.4250 | 1.0000 | 13455 | 7861 | 1.71 |
| VST | VST | 0.1757 | -85.3030 | 0.4800 | 22551 | 3 | 7517.00 |
This section uses read count data to perform several sample-level analyses. The results can be used to identify potential mislabeling, outlier, confounding variable, etc. Read counts normalized by the Loess method are used for all analyses in this section.
When enough genes on X and Y chromosomes have detectable expression, these genes can be used to predict the gender of samples. In case the gender information is already available, the predicted gender can be used to identify potential mislabeling.
Figure 10. The average of Log2(Read Count+1) is calculated for X and Y chromosomes when there are at least 5 genes having detectable expression. The averages of each sample are plotted in this figure as X and Y axes. The averages of each chromosome are used separately to cluster samples using the pamk {fpc} function. If the optimal number of clusters is 2 for both chromosomes, gender prediction will be made, and the samples having predicted gender agreed byt the two chromosomes will be labeled with red (female) or blue (male). Click here to veiw the full table of gender prediction.
Hierarchical clustering groups samples based on their glabal correlation of all genes. It is a iterative procedure that merges two most similar samples/groups at each step.
Figure 11. Hierarchical clustering of samples using all autosomal genes. Normalized read counts are log2-transformed before clustering. The vertical position of the common node of two samples indicates their similarity (lower = more similar).
Principal components analysis
Figure 12. PCA is performed using all autosomal genes after read counts are log2-transformed. The same plot is made in multiple tabs while samples are color-coded based on known sample features.
Check out the RoCA home page for more information.
To reproduce this report:
Find the data analysis template you want to use and an example of its pairing YAML file here and download the YAML example to your working directory
To generate a new report using your own input data and parameter, edit the following items in the YAML file:
if (!require(devtools)) { install.packages('devtools'); require(devtools); }
if (!require(RCurl)) { install.packages('RCurl'); require(RCurl); }
if (!require(RoCA)) { install_github('zhezhangsh/RoCAR'); require(RoCA); }
CreateReport(filename.yaml); # filename.yaml is the YAML file you just downloaded and edited for your analysis
If there is no complaint, go to the output folder and open the index.html file to view report.
## R version 3.3.3 (2017-03-06)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X Yosemite 10.10.5
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] edgeR_3.16.5 DESeq2_1.14.1
## [3] SummarizedExperiment_1.4.0 Biobase_2.34.0
## [5] limma_3.30.13 gplots_3.0.1
## [7] DEGandMore_0.0.0.9000 snow_0.4-2
## [9] awsomics_0.0.0.9000 Rnaseq_0.0.0.9000
## [11] GenomicRanges_1.26.4 GenomeInfoDb_1.10.3
## [13] IRanges_2.8.2 S4Vectors_0.12.2
## [15] BiocGenerics_0.20.0 fpc_2.1-10
## [17] vioplot_0.2 sm_2.2-5.4
## [19] htmlwidgets_0.9 DT_0.2
## [21] yaml_2.1.16 knitr_1.18
## [23] rmarkdown_1.10.3 RoCA_0.0.0.9000
## [25] RCurl_1.95-4.9 bitops_1.0-6
## [27] devtools_1.13.4 kableExtra_0.9.0
##
## loaded via a namespace (and not attached):
## [1] bit64_0.9-7 RColorBrewer_1.1-2 httr_1.3.1
## [4] rprojroot_1.3-2 prabclus_2.2-6 tools_3.3.3
## [7] backports_1.1.2 R6_2.2.2 rpart_4.1-10
## [10] KernSmooth_2.23-15 DBI_0.7 Hmisc_4.1-0
## [13] lazyeval_0.2.1 colorspace_1.3-2 trimcluster_0.1-2
## [16] nnet_7.3-12 withr_2.1.1 gridExtra_2.3
## [19] bit_1.1-12 rvest_0.3.2 htmlTable_1.11.1
## [22] xml2_1.1.1 checkmate_1.8.5 diptest_0.75-7
## [25] caTools_1.17.1 scales_0.5.0 DEoptimR_1.0-8
## [28] mvtnorm_1.0-6 robustbase_0.92-8 genefilter_1.56.0
## [31] readr_1.1.1 stringr_1.2.0 digest_0.6.13
## [34] foreign_0.8-67 XVector_0.14.1 base64enc_0.1-3
## [37] pkgconfig_2.0.1 htmltools_0.3.6 highr_0.6
## [40] rlang_0.1.6 RSQLite_2.0 rstudioapi_0.7
## [43] jsonlite_1.5 mclust_5.4 BiocParallel_1.8.2
## [46] gtools_3.5.0 acepack_1.4.1 magrittr_1.5
## [49] modeltools_0.2-21 Formula_1.2-2 Matrix_1.2-8
## [52] Rcpp_0.12.14 munsell_0.4.3 stringi_1.1.6
## [55] MASS_7.3-45 zlibbioc_1.20.0 flexmix_2.3-14
## [58] plyr_1.8.4 blob_1.1.0 grid_3.3.3
## [61] gdata_2.18.0 lattice_0.20-34 splines_3.3.3
## [64] annotate_1.52.1 hms_0.4.0 locfit_1.5-9.1
## [67] pillar_1.1.0 geneplotter_1.52.0 XML_3.98-1.9
## [70] evaluate_0.10.1 latticeExtra_0.6-28 data.table_1.10.4-3
## [73] gtable_0.2.0 kernlab_0.9-25 ggplot2_2.2.1
## [76] xtable_1.8-2 class_7.3-14 survival_2.40-1
## [79] viridisLite_0.2.0 tibble_1.4.2 AnnotationDbi_1.36.2
## [82] memoise_1.1.0 cluster_2.0.5
END OF DOCUMENT